Background

In the United States of America, the 100-pt rating system is commonly used to indicate the production quality and typicity (how well the wine’s trait represent the style and region it’s from) of wine. Wine Enthusiast offers reviews for wine with a score range from 80 to 100 points.

Wine Enthusiast’s 100-point wine-scoring scale

With around 130k wine reviews published on Wine Enthusiast, I hope to explore the data and address the following questions:

  1. Which province offers wine of highest average price?
  2. Which province had the most reviews? Which varieties had the most reviews in this province?
  3. Can ratings in points help estimate price?
  4. What are some of the most common terms used in positive reviews? Negative reviews?

Preprocessing Stage

The first step is to prepare the data for analysis.

Clean up environment and load packages

The packages used are tidyverse and wordcloud.

# clean up the RStudio environment 
rm(list = ls())

# load all packages here: `tidyverse`, `lubridate`, and all others used
library('tidyverse')#tidying data
library('wordcloud')#work with word cloud

Data Intake

The data was found on and downloaded from Mavenan Alytics (Source: Zack Thoutt, scraped from WineEnthusiast). The codes below sets up data from the downloaded csv files.

### Load wine data from wine_data.csv
Wine <- read_csv("wine_data.csv") #full data with ~130k rows

Data Inspection

This dataset has 129,971 rows and 14 variables (3 numerical variables and 11 categorical variables). Each row of the data appears to be a case of review for wine. There are several NA’s in the fields and the maximum price is a lot larger than 75% quantile, which suggests the necessity of data cleaning stage.

Wine %>%
  head(3) #the glimpse function will present the columns down the page and rows across
## # A tibble: 3 × 14
##      id country  descript…¹ desig…² points price provi…³ regio…⁴ regio…⁵ taste…⁶
##   <dbl> <chr>    <chr>      <chr>    <dbl> <dbl> <chr>   <chr>   <chr>   <chr>  
## 1     0 Italy    Aromas in… Vulkà…      87    NA Sicily… Etna    <NA>    Kerin …
## 2     1 Portugal This is r… Avidag…     87    15 Douro   <NA>    <NA>    Roger …
## 3     2 US       Tart and … <NA>        87    14 Oregon  Willam… Willam… Paul G…
## # … with 4 more variables: taster_twitter_handle <chr>, title <chr>,
## #   variety <chr>, winery <chr>, and abbreviated variable names ¹​description,
## #   ²​designation, ³​province, ⁴​region_1, ⁵​region_2, ⁶​taster_name
Wine %>%
  summary() #the summary function will summarize the table
##        id           country          description        designation       
##  Min.   :     0   Length:129971      Length:129971      Length:129971     
##  1st Qu.: 32492   Class :character   Class :character   Class :character  
##  Median : 64985   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 64985                                                           
##  3rd Qu.: 97478                                                           
##  Max.   :129970                                                           
##                                                                           
##      points           price           province           region_1        
##  Min.   : 80.00   Min.   :   4.00   Length:129971      Length:129971     
##  1st Qu.: 86.00   1st Qu.:  17.00   Class :character   Class :character  
##  Median : 88.00   Median :  25.00   Mode  :character   Mode  :character  
##  Mean   : 88.45   Mean   :  35.36                                        
##  3rd Qu.: 91.00   3rd Qu.:  42.00                                        
##  Max.   :100.00   Max.   :3300.00                                        
##                   NA's   :8996                                           
##    region_2         taster_name        taster_twitter_handle    title          
##  Length:129971      Length:129971      Length:129971         Length:129971     
##  Class :character   Class :character   Class :character      Class :character  
##  Mode  :character   Mode  :character   Mode  :character      Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##    variety             winery         
##  Length:129971      Length:129971     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

Data Cleaning

In the following parts, missing values, duplicated values, and extreme values in data are fixed or removed.

Missing Values

First, the missing value problem is evaluated and dealt with. According to the analysis, 107584 rows and 9 columns contain missing values, which is about 83% of the data. Hence, it’s not rational to delete all data contain missing values. The variables that are missing more than 10% are: designation, taster_name, taster_twitter_handle, region_1, and region_2.

  • Since the “region_1”, “region_2”, and “taster_twitter_handle” are not related to our analysis and they contain many missing values, these three columns will be dropped.
  • For “designation” and “taster_name”, since removing them will result in significant data loss, the missing values will be filled as “Unknown”.
  • Moreover, for variables with limited missing data, the rows that contain missing values will be dropped.
## Only keep relevant cols
Wine <- Wine %>%
  select(-c(region_1, region_2, taster_twitter_handle))

## Define cols we replace NA with 'Unknown'
unk_cols <- c("designation", "taster_name")
## Replace NAs with 'Unknown'
Wine[unk_cols] <- Wine[unk_cols] %>%
  replace_na(list(designation = 'Unknown', taster_name = 'Unknown'))
## Remove the rest rows with missing values
Wine <- 
  Wine %>%
  na.omit()
## Double check number of rows with missing values
#sum(rowSums(is.na(Wine)) != 0)
#nrow(Wine)

After these steps, we have 120915 rows of data, with is 7% less than original and acceptable.

Duplicate Values

Now there are no missing values, let’s check if there are any duplicated values. Since the ids are added afterwards to make each row unique, uniqueness is tested without the id column. There are 9378 duplicated rows, which will be removed and new ids will be assigned to the distinct data. As a result, we have 111537 entries of unique data.

## Count duplicated descriptions
Wine %>%
  select(-id) %>%
  duplicated() %>%
  sum()
## [1] 9378
## Remove duplicate values
Wine <- Wine %>%
  select(-id) %>%
  distinct()

## Re-assign ids
Wine <- Wine %>%
  mutate(id = 1:n())

Extreme Values

According to the summary table above, the price field has a maximum value of 3300, while the 70% quantile is 42.

To figure out the distribution of price, histogram of is plotted. The histogram looks extremely right skewed, which make sense because most wine have moderate price and a few of them will be a little more expensive.

hist(Wine$price, main = 'Histogram of Price', xlab = 'Price') #histogram of price

By taking the log transform, the histogram looks much better, but some outliers still seem to exist.

Wine$price_log <- log(Wine$price) #log transform of price
hist(Wine$price_log, main = 'Histogram of log(Price)', xlab = 'log(Price)') #histogram of log(price)

With more research, reviews with extremely high price values seem to be mistakes rather than real expensive wine. Hence, the approach to deal with these extreme values is to delete them from the dataset.

Price is 2013 on this website Price is 40+ on Buy now page

summary(Wine$price_log) #summarise log(price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.386   2.833   3.219   3.317   3.738   8.102
iqr = 3.738-2.833 #calculate IQR
Wine <- Wine %>%
  filter(price_log >= 2.833-iqr*2 & price_log <= 3.738+iqr*2) #delete data with extreme price value

As a result, we have 111,103 rows of clean data.

Wine %>%
  nrow()
## [1] 111103

Data Analysis Stage

Now that the data is clean, I will analyze the data and try to answer four questions mentioned above.

Question 1

Which province offers wine of highest average price?

Answer: According to the data, Switzerland has the highest average price in wine reviews.

However, problem we see is that these top ranking provinces only have very few number of reviews for few wines. With too few observations, the mean value can be biased. Hence, limit the province to ones with at least 10 (types of) wines could help set a baseline for number of observations. For provinces with at least 10 wines, Champagne has the highest average price.

## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.

Question 2

Which province had the most reviews? Which varieties had the most reviews in this province?

Answer: According to the plot below, California is the most reviewed province, with more than 33k (~30%) reviews.

Point Noir, Cabernet Sauvignon, and Chardonnay are most reviewed in California.

Question 3

Can ratings in points help estimate price?

Answer Let’s first visualize the relationship between price and points. According to the scatterplot, the price does seem higher for higher points. However, the relationship doesn’t seem linear, but rather exponential.

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

By plotting points with log(price), the relationship seem approximately linear.

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

By Fitting a linear relationship between log(price) and points, we can see that the slop and intercept are both significant. The estimated prediction equation is: \[price = e^{0.127*points-7.897}\] That is, a 80 point wine is estimated to sell at price of 9.61. The R-squared value for this model is 37.32%. Hence, this model explain 37.32% of the variability in price, which indicates points can help estimate price, but in order to get a more accurate estimation, we may need more information.

## 
## Call:
## lm(formula = price_log ~ points, data = Wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.69148 -0.36461 -0.04212  0.32854  2.79268 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.8965288  0.0435857  -181.2   <2e-16 ***
## points       0.1267464  0.0004928   257.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5053 on 111101 degrees of freedom
## Multiple R-squared:  0.3732, Adjusted R-squared:  0.3732 
## F-statistic: 6.615e+04 on 1 and 111101 DF,  p-value: < 2.2e-16

Question 4

What are some of the most common terms used in positive reviews? Negative reviews?

AnswerConsidering the Wine Enthusiast’s 100-point wine-scoring scale, let’s define positive reviews by points equal to or higher than 98 and negative reviews by 8 less than or equal to 82.

Step 1 Define the word count function The first step is to define a function that convert the vector of descriptions into a dataset of the vocabularies used and their frequencies.

Step 2 Visualize the overall word frequency for reviews

Positive reviews:

Negative reviews:

Both plots above contain meaningless terms such as “and”, “the”, and so on. These non-relevant words will be removed before further analysis.

Step 3 Clean up the key words list using stop words

Positive reviews: In positive reviews, “black” and “dark” are usually used to describe color of the wine. Words such as “best”, “perfect”, “superb” indicates the taster really liked these wines.

Negative reviews: While some positive words such as “sweet” and “soft” shows up in negative reviews, “dry”, “bitter”, and “sour” seem to reveal the taste of these wine.

Overall, in positive reviews, the tasters mainly praised on the dark color and great overall quality. On the other hand, tasters usually criticize the bitter, dry, and sour taste of wine in negative reviews.

Conclusion

The following business insights are drawn based on the analysis: